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Abstract 

Background: Mass spectrometry-based metabolomic analysis depends upon the identification of spectral peaks by 
their nnass and retention time. Statistical analysis that follows the identification currently relies on one main peak of 
each compound. However, a compound present in the sample typically produces several spectral peaks due to its 
isotopic properties and the ionization process of the mass spectrometer device. In this work, we investigate the extent 
to which these additional peaks can be used to increase the statistical strength of differential analysis. 

Results: We present a Bayesian approach for integrating data of multiple detected peaks that come from one 
compound. We demonstrate the approach through a simulated experiment and validate it on ultra performance 
liquid chromatography-mass spectrometry (UPLC-MS) experiments for metabolomics and lipidomics. Peaks that are 
likely to be associated with one compound can be clustered by the similarity of their chromatographic shape. 
Changes of concentration between sample groups can be inferred more accurately when multiple peaks are available. 

Conclusions: When the sample-size is limited, the proposed multi-peak approach improves the accuracy at inferring 
covariate effects. An R implementation and data are available at http://research.ics.aalto.fi/mi/software/peakANOVA/. 

Keywords: ANOVA-type modeling, Bayesian modeling. Clustering, Mass spectrometry, Metabolomics, Lipidomics, 
Nonparametric Bayes 



Background 

The study of changes in the levels of metabolites and lipids 
has become essential for the comprehensive understand- 
ing of human health [1]. Chromatography-coupled mass 
spectrometry (MS) techniques have become the standard 
method for characterizing the human metabolome [2] 
and lipidome [3]. The technique generates a spectrum of 
peaks describing the sample in the plane defined by the 
retention time from the chromatograph and the mass-to- 
charge ratio from the mass spectrometer. Each peak in 
this plane is either generated by an ion arising from one 
of the compounds present in the sample, or is an arti- 
fact of the measurement without association to any of 
the compounds. The association between the peaks and 
compounds is unknown a priori. The produced peak data 
are noisy: First, sample preparation introduces sources of 
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uncertainty that propagate to the analysis [4]. Second, the 
accuracy of the device is limited [5] and it produces biases. 
Third, peak identification, annotation and pre-processing 
steps produce additional layers of uncertainty [6]. The 
effect of errors at all these levels is exacerbated by the 
"small n, large p' problem: experiments cover a very lim- 
ited number of samples, n, while the set of compounds 
measured, p, is potentially large. 

However, there also is strong informative struc- 
ture in the data: First, each compound may gener- 
ate multiple adduct peaks [7] (Figure 1) and isotope 
peaks [8,9] (Figure 2), whose positions and shapes pro- 
vide information about the identity of the compound. 
Second, the concentrations of different compounds gen- 
erated by or participating in similar biological processes 
may be highly correlated [10]. An increasing number 
of machine learning algorithms are being developed for 
inferring such structure either from raw spectral data [11] 
or from processed intensity data [12]. The inference of 
covariate effects — the differences between sample groups 
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Expected adduct peaks of two lipids 
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Figure 1 A schematic of the positions of typical adduct peaks [7] 
in the RT-m/z plane for two lipids, the ceramide Cer(d18:1/17:0) 
and the sphingomyelin SIVl(d1 8:1 722:0). An adduct peak is formed 
by an ion attacliing to tine compound. At tine finer detail, each peak in 
the figure consists of multiple isotope peaks few atomic units apart, 
as shown for Cer(dl 8:1/1 7:0) in Figure 2. Even though the distinct 
isotope peaks are not visible to the eye here, they are clearly 
separable by the mass spectrometer. In the figure, adduct types and 
compounds are marked by colors and characters, respectively. 
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Figure 2 Natural isotopic distribution of the mass of a typical 
lipid, the ceramide Cer(d1 8:1 /1 7:0). The presence of atomic 
isotopes leads to the appearance of multiple mass spectral peaks 
from the compound. Some isotopes are very similar by their mass but 
still differentiable by the mass spectrometer. The isotope peaks have 
distinct mass-to-charge ratios at the same retention time (Figure 1 ). 



determined by the controlled covariates of the experi- 
ment, such as an intervention — is in the core of the com- 
parative analysis of spectral profiles [13]. In addition to 
the controlled covariates, confounding factors may affect 
the observations and are subject to the experiment design. 
In this work, we focus on inferring effects of the controlled 
covariates from the data. 

The existence of additional peaks in the spectrum is 
usually seen as a problem and only the main peak of 
each identified compound is used for further analysis. 
All peaks are a result of the ionisation process where 
a charged particle is attached to or detached from a com- 
pound. Each such compound-ion pair produces a distinct 
adduct peak. Random variation in the ionisation process 
leads to inconsistencies between batches of samples, per- 
ceived as variation in the ratio of intensities of the peaks 
associated with one compound. This is a major source 
of error for all existing analysis approaches regardless of 
the choice of the peak used for the analysis. On the other 
hand, the distribution of the intensities of isotope peaks 
is by nature well preserved across both samples and com- 
pounds. Moreover, the natural isotopic distribution of 
a compound is known and can be used to make peak 
annotation more precise. In this way, isotope peaks pro- 
vide reliable additional information about the differences 
in compound concentrations between sample groups. 

We propose a probabilistic approach for extending sta- 
tistical analysis to all available peaks and demonstrate 
that the additional peaks can provide a real benefit to 
the inference of covariate effects (Figure 3). The approach 
is used to cluster the peaks that are likely to arise from 
a single compound together and to infer the changes in 
concentrations of the compounds more accurately based 
on all these peaks. By this approach, we are address- 
ing the problem of inadequate sample-size by introduc- 
ing additional data describing the compounds behind 
the noisy measurements. 

To solve the problem we introduce the following 
assumptions about the generative process of the data 
within a Bayesian model: First, samples carry between- 
group differences in their compound concentrations and 
the differences arise from responses to controlled covari- 
ates. Second, multiple observed spectral peaks follow 
an identical generative process and their heights are 
a noisy reflection of the true concentration level of 
the compound. Third, shapes of the peaks from one com- 
pound are generated through an identical process follow- 
ing the properties of the measurement device, and thus 
these shapes are highly similar. 

The approach presented in this paper consists of two 
stages of computational inference: (1) peaks that share 
a compound as their generative source are clustered 
together, and (2) the responses to controlled covariates of 
the experiment are inferred on these clusters of peaks. 
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a) Peak clustering based on shapes 
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b) Inference of covariate effects based on intensity 
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Figure 3 Flowchart of the method, (a) Peaks are clustered by their shapes, (b) Covariate effects are inferred based on the intensities of 
the clustered peaks. 



The clustering part of the approach is based on a non- 
parametric Bayesian Dirichlet process model [14]. To 
improve the performance of this model, we have rede- 
fined the prior distributions from a normal distribution to 
a beta distribution to improve the match to the peak shape 
similarity observations. 

The model for inferring the responses to covariates 
operates on clusters inferred in the first part. A Bayesian 
multi-way model [13] is suitable for this task. This model 
itself could be used for clustering summarized mass spec- 
tral intensity data, but in this work, we demonstrate that 
the clustering can be done more accurately based upon 
the similarity of chromatographic peak shapes. 

Methods 

This section describing the models consists of two parts: 
clustering of spectral peaks and inference of covariate 



effects. To maintain the mathematical rigor in the section, 
we use the terms "samples," "variables" and "clusters" to 
refer to the experimental runs of the mass spectrome- 
ter, the peaks in the mass spectrometry data, and the yet 
unknown compounds in the experimental runs, respec- 
tively. In the equations, we denote them by the indices 



/ =1, . . . , A/" (samples, i.e., experimental runs), 

7 =1, . . . , P (variables, i.e., peaks), 

k =1, . . . ,/<r (clusters, i.e., compounds). 



(1) 



respectively, where N, P and K are their respective total 
numbers. We use bold capital, bold non-capital and non- 
bold non-capital symbols to refer to matrices, vectors and 
scalars, respectively {e.g., V, v and v). 
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Clusters of peaks based on the similarity 

Following earlier work [14], we measure the similar- 
ity between the shapes of two peaks by their Pearson 
correlation computed over a window of retention time 
after a standard peak alignment [15] across the sam- 
ples. Truncating negative values to zero, this leads to 
a distinct similarity matrix Q/.. g[0, l]^^^ for each sam- 
ple /. In the notation, the operator indicates that 
the entire dimension of the array is included, not 
only the single item Since a peak is not necessarily 
observed in every sample, there may be missing values 
in the matrices. Therefore, we construct an additional 
mask R e {0, i]^^^^^ with binary values r/^y indicating 
whether the peak pair (j,f) in sample / appears together 
within the window where the similarity is measured and 
whether both of the peaks are observed. An unidentified 
peak may still be present in the sample below the limit of 
detection of the mass spectrometer. However, then it is 
not useful for the inference of covariate effects and, thus, 
is treated as missing. 

Model 

We assume that the peaks are generated through a 
Dirichlet process [16]: there is an unknown number 
of clusters and an unknown and variable number of 
peaks that arise from each of the clusters. Peaks are 
assumed to have a one-out-of-many association: each 
peak is associated with only one of the unknown clusters. 
With these basic assumptions, we can infer the P-hy-K 
clustering matrix V from the data Q. Value vp^ = 1 in 
the clustering matrix V assigns the peak ; to the clus- 
ter k. To make the following equations more com- 
pact, we use an additional variable, sj^ = v/.vj g {0, 1}, 
which is an inner product of the cluster indicator vec- 
tors of the peaks / and /, to denote whether the two 
peaks come from the same or different clusters (1 or 0, 
respectively). 

We set a spike-and-slab prior [17] for the peak shape 
similarity to model the inherent sparse structure of 
the data. The similarity between any pair of observed 
peaks (/,/) is assumed to follow a beta distribution, but 
the shape of the distribution is assumed to depend on 
whether the pair comes from the same cluster or from dif- 
ferent clusters (shape parameters (uin, hn) or (<^out> ^out)> 
when Sjf = 1 or 0, respectively). Also the probability 
of a missing similarity value is assumed to depend on 
the cluster assignment of the pair (p^^ or p^^^, when sjf = 
1 or 0, respectively). The distributional assumptions are 



^ijf (l-/^(r)^^ta {qijjf lain, hn) + P'q^ {njf), £jf = 1, 
^in' (1 -pT) Beta {qijr l^out, ^out) + pT^ {njj), sjf = 0, 

(2) 



with the first and the second row of the equation stating 
the distributions of a peak pair from the same cluster and 
different clusters, respectively. The likelihood of the entire 
peak shape data, 

P-1 P 

c (Q, Ri V) = n n n ^ i^^/o > (3) 

i=l j=l 

becomes a product over all peak pairs and samples follow- 
ing the distributional assumption of Equation 2. 

We further assume that the observed peaks are gener- 
ated from an unknown finite subset of an infinite set of 
clusters with an equal prior probability, 

p (su^ = l) = ^- , (4) 

for any pair of peaks to be generated from the same 
cluster. These assumptions define the Dirichlet process, 
controlled by the concentration parameter q^dP; which 
determines the prior probability mass outside the existing 
clusters. Following from this prior assumption, the proba- 
bility of assigning peak / to an existing cluster k, 

p {vjk = 1|Q, R, V_y,) a SkC (Q, R|V_y,, vjk = l) , (5) 

becomes weighted by the current size of the clus- 
ter, 5^ = v^^.^v_y;y^. In the notation, matrices V.-k and 
V_y,. correspond to the matrix V with the column k and 
the row / omitted, respectively. Alternatively, with proba- 
bility 

p (vy,/c+i = 1|Q,R, V) oc au^C (Q,R|V_y,, Vy,/c+l = 1) , 

(6) 

the process may create a new cluster with the index /C + 1 
and only the peak / inside. Then, the likelihood term is 
weighted by the Dirichlet process concentration parame- 
ter aDP> which can be seen as a pseudo-count for the num- 
ber of peaks outside the current K clusters. 

Inference 

We infer the posterior distribution of the clustering via 
Gibbs sampling, which results in a set of S samples of 
the clustering V^^\ s = 1, . . . , 5', from the true posterior 
distribution p (V|Q, R). The computational complexity of 
a Gibbs iteration is O (KP^), Further analysis can oper- 
ate on the entire posterior distribution of the clustering 
through integration, or on a point estimate of the distri- 
bution. We follow earlier work [18] and acquire a point 
estimate of the posterior distribution of the clustering 
through finding the least-squares clustering (Section 1 in 
Additional file 1). 

Covariate effects based on peak heights 

Having inferred the grouping of similar peaks into clusters 
that each correspond to a compound, we infer the dif- 
ferences in concentrations between sample groups for 
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each cluster given the peak height data X g R^^^ and 
the clustering V. Again, some values in the data may be 
missing. 

Model 

After a peak-specific centering based on the control 
group, the observed peak heights for each sample / are 
assumed to be normally distributed with a cluster-specific 



mean x 



lat. 



x,|V,xi^(r2-^Ar(vxif,A), 



(7) 



where the diagonal matrix A contains the peak-specific 
variance parameters ar^ e M^. The cluster-specific means 
are assumed to be normally distributed with a sample 
group-specific prior a, 



x^f |a,^,- -A/'(a.«,,l), 



(8) 



where Ui G {1, . . .,1^} is an indicator of group member- 
ship (covariate level) for sample / and I is a/C-by-ZC identity 
matrix. The corresponding covariate effects are arranged 
into an K-hy-La matrix a and the effects are assumed to 
be independent and normally distributed. 



a/ 



5(a./),/ = l 
A/" (0,1), 1 = 2,. 



(9) 



except for the first level, / = 1, which is defined as 
the baseline level and thus is fixed to zero. The model is 
not limited to one covariate: the cluster-specific mean x|.^^ 
can be expressed as a sum of effects of multiple covari- 
ates and their interaction effects (Section 1 in Additional 
file 1). Further, the model is readily extensible for depen- 
dent covariate effects [19]. 
The peak-specific variance parameter. 



a? Scale-Inv-x^ {no, ctq) , 



(10) 



follows a scaled inverse-/^ distribution with no prior 
samples and a scale cTq . 

Inference and analysis 

We infer the covariate effects via Gibbs sampling. Now 
the clustering matrix V has been learned earlier, and is 
thus taken as known in the model. Computational com- 
plexity of a Gibbs iteration is O (NPK^). The clustering 
and the covariate effects can be inferred overnight on 
a standard desktop computer for a typical-sized data set. 
The posterior distributions of the covariate effects a are 
descriptive of the differences between the sample groups 
and, thus, interesting from the analysis point of view. 
To assess the significance of the difference between a sam- 
ple group, c = I > If and the control group, c = 1, for 
a cluster we can study the posterior probability of 
the effect a^/ being greater or less than zero. 



Comparison methods 

We call the method described above Model 1. We com- 
pared the performance of the following approaches and 
refer to them as Models 1, 2 and 3: 

1. the multi-peak approach using both peak shape and 
height information, as proposed in this 

work (nonparametric clustering of peaks by their 
shape similarity, inference of covariate effects on 
the clusters based on the height of the peaks), 

2. the multi-peak approach using peak height 
information only [13] (clustering of peaks and 
inference of covariate effects based on the height of 
the peaks only), 

3. the typical single-peak approach (inference of 
covariate effects by the height of the strongest 
annotated peak only). 

For the studied real data sets, we discovered that peak 
height information alone is not enough for clustering 
the peaks into individual compounds due to the high level 
of noise and strong correlations between compounds. 
Thus, for real data we compared Model 1 to Model 3 
and highlight the benefit gained by using peak shape 
information. 

Model 2 assumes the generative Gaussian latent variable 
model of the Equations 7-10 for the intensity observations 
X and a uniform multinomial prior for the clustering of 
the peaks. The clustering is inferred by Gibbs sampling 
together with the covariate effects. 

Model 3 quantifies the difference between the covariate 
level, c = /, and the control level, c = 1, as the difference 
of their means based on the main peak /, 



1 ^ 1 ^ 



(11) 



The Kronecker delta function selects the samples 
that have the covariate level / by receiving the value 1, 
when Ui = /, and 0, otherwise. When the data are log- 
transformed, the mean difference corresponds to the fold 
change computed in many analysis platforms such as 
MZmine [15] and XCMS [6]. 

Experiments 

We demonstrate the performance of the proposed method 
through three experiments: a simulated data experiment, 
a spike-in benchmark experiment with known changes 
in concentrations, and a gene silencing experiment with 
measurements of the lipidome of cancer cells. 
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Evaluation measures 

Evaluation of the performance on real data sets is not 
a trivial task, as there is no ground truth available: nei- 
ther the identity of the peaks nor the true effect sizes are 
known. Thus, we also used spike-in data, where the true 
covariate effects are known, although only a small number 
of the peaks are annotated. 

For the simulated and benchmark experiments, we com- 
puted the mean squared error (MSE) between inferred 
and true covariate effects as an evaluation metric. 
As a result of the log-transformation of the intensity 
data, we were quantifying relative changes between sam- 
ple groups, independent of the average height of each 
peak. In the model, we thus assumed that the change 
is preserved across the peaks of one compound, in rela- 
tive terms. The significance of the difference in the MSE 
of the proposed approach and the comparison method 
was tested by the paired one-sided ^-test. The false dis- 
covery rate was controlled by the Benjamini-Hochberg 
step-up procedure [20]. Additionally for the simulated 
experiment, we studied the inference of the statistical sig- 
nificance of effects, since the true distribution of the data 
was known. 

To assess the sensitivity of the approaches to noise in 
natural lipidomic data lacking a ground truth, we used 
two types of indirect evaluation: First, we studied the con- 
sistency of the inferred covariate effects given a prior 
assumption about their similarity. Second, we examined 
the robustness of the inferred covariate effects to noise. 
Finally, we demonstrated differences between the multi- 
peak and single-peak approaches through examples of 
qualitative analysis of annotated peak clusters. 

Simulated data 

We started by investigating the performance of the pro- 
posed approach on synthetic data, where the true covari- 
ate effects are known. We focused on a usual task 
in exploratory analysis of biological data: the detection 
of significant non-zero covariate effects. We measured 
the performance by the accuracy at this task — the ratio 
of true positive and true negative significant differences 
among all effects. We used the 95% posterior quantiles 
to determine the significance. Additionally, we compared 
the approaches by the MSE to the true effects and studied 
the performance of the two clustering models by com- 
puting the normalized information distance (NID) [21] to 
the true clustering. 

The approaches were tested across a set of potential 
experimental settings to study how the observation of 
additional peaks and samples affects the performance. 
Simulated data were generated by assuming the latent 
structure of Model 1. The following data parameters were 
varied on a grid: sample-size A/^ = 2 x {3, 7, 15} and peak- 
specific noise = {1,5}. Additionally, the number of 



peaks per cluster was varied between 3, 7 and 15. Covari- 
ate effects a. 2 =[2,-1,0.5,0,0,0,0] were generated for 
each data set. The experiment was repeated 100 times 
with independent data sets. The results are reported in 
the Results and discussion section. 

Benchmark data with known changes in concentrations 

The benchmark data set of apple samples [22] includes 
a set of annotated spike-in compounds with increases 
of 20, 40 or 100% in concentrations. We started with 
the raw spectral data [23] in order to acquire the shapes 
of the peaks in addition to their heights. The mass spec- 
tra were pre-processed using MZmine 2 [15] (Section 4 
in Additional file 1). We used standard pre-processing 
methodology also used in the original publications of 
the data sets, thus maintaining the focus of the work 
on the potential benefit gained from the multiple peaks. 
The compared approaches were on the same line in terms 
of the data. 

We evaluated the approaches by the MSE between 
inferred and true covariate effects. If the cluster con- 
tained multiple annotated peaks, the effect of each anno- 
tated peak was evaluated separately for the single-peak 
approach. Clusters with no annotated peaks were consid- 
ered to have a 0% true effect and the effect of the single- 
peak approach was inferred based on the strongest peak 
of the cluster. 

Lipidomic data from a gene silencing study 

The data come from a recent experiment [24] to study 
the effects of gene silencing treatments on lipidomic pro- 
files and growth of breast cancer tissue. Driven by the lack 
of ground truth about the covariate effects, we evaluated 
the inferred effects indirectly in two ways: (1) by quanti- 
fying the consistency of the effects within a lipid family 
and (2) by quantifying the robustness of the magnitudes 
of the inferred effects across the lipidome in presence of 
additional noise. Additionally, we investigated the stabil- 
ity of the inferred clustering on the data and qualitatively 
analyzed differences between the covariate effects of sin- 
gle peaks and the effects inferred on clusters of peaks by 
Model 1. 

The data included 32 lipidomic profiles of breast can- 
cer cells from the ZR-75-1 cell line. We inferred the effects 
of seven distinct silencing interventions on metabolism- 
regulating genes (Section 5 in Additional file 1) at two 
time points. The raw spectra were pre-processed with 
MZmine 2 as described in the original publication [24], 
in addition to which the shape similarities of the peaks 
were computed. 

Consistency of effect signs* In the first task, we quan- 
tified the consistency as the accuracy at predicting 
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the covariate effect of a test lipid given the model on 
the covariate effects of other lipids of the same family. 
For instance, we predicted the effect of a gene silenc- 
ing treatment on the sphingomyelin SM(dl8:l/22:0) based 
on the sphingomyelin compounds in the training set. We 
examined the sign of the effect instead of the absolute 
effect, since even within a family of lipids the changes have 
a high variance and thus cannot be reliably predicted with- 
out imposing additional information about the biological 
system. 

We predicted the signs of the covariate effects for test 
lipids in a three-fold cross-validation setting with 100 ran- 
domizations. The examined lipids included the annotated 
members from the three most abundant families of lipids 
that had two or more peaks identified with the clustering 
model (Section 5 in Additional file 1). 



Further, we studied the influence of noise to the con- 
sistency by adding independent normally distributed 
noise (from a = 0 to a = 10) on the peak intensity 
observations. Added noise variance a = 1 was equal 
to the existing original variance in the data, and 
the upper bound for the signal-to-noise ratio then was 0.5 
(Additional file 1: Table S4). 

Robustness of effect magnitudes. To evaluate the 
inferred effects at the scale of the entire observed 
lipidome, we examined the consistency of inferred covari- 
ate effects between the original and noise-added data 
sets. This experiment simulated the situation where 
the true effects are known (effects from the original 
data set), but the data based on which the effects are 
inferred are noisy (the added-noise data set). To measure 
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Figure 4 The use of data from multiple peaks and the peak shape information increased the accuracy at detecting significant covariate 
effects on simulated data. Accuracy of Models 1,2 and 3 for simulated data is shown as a function of the sample-size in two settings: normal and 
high level of noise (left: a ^ = l,and right: = 5, respectively). Top (a-b): Accuracy at inferring the significance of the generated covariate effects. 
Bottom (c-d): Normalized information distance (NID) between the inferred and the true clustering. An entirely random and an exactly correct 
clustering correspond to a NID of 1 and 0, respectively. 
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Figure 5 The performance of Model 1 improved when more 
pealcs per compound were available in the simulated data. 

The curves show the accuracy as a function of sample-size for 
simulated data with 1 5, 7 and 3 peaks per compound. 



the consistency, we computed the Spearman correlation 
between the covariate effects inferred from the origi- 
nal and the added-noise data sets. We studied all clus- 
ters with two or more peaks, constituting 20% of the 
clusters. 

Results and discussion 

Simulated data 

On a normal level of noise (a^ = 1), the multi-peak 
approaches (Models 1 and 2) always performed better 
at detecting significant covariate effects than the single- 
peak approach (Model 3; Figure 4a) and only with enough 
samples the performance of Model 2 became compa- 
rable to Model 1. The inferred clustering of Model 1 
was perfect while the clustering performance of Model 2 
heavily depended on the number of samples available 
(Figure 4c). 

On a high level of noise {<7^ = 5), only Model 1 
worked (Figure 4b). The reason for the failure of Model 2 
was the imperfectly inferred clustering (Figure 4d). 
The good performance of Model 1 resulted from the clus- 
tering step, which is robust to noise in the peak heights. 
The peak shape similarity gave strong evidence for infer- 
ring the clusters already from a single sample. 

The MSE between the inferred and true covariate 
effects for Model 1 was smaller compared to Model 3 
in all the 24 setups of the experimental grid (Additional 
file 1: Table SI). The difference was statistically signifi- 
cant in 22 setups and in all setups at the high level of 
noise. 

The performance of Model 1 clearly improved, 
when more peaks from a cluster were present in 
the data (Figure 5). This was pronounced at a high level 
of noise, when the observation of a single peak is unre- 
liable for inferring the covariate effects. In a similar way 
as in averaging over samples, the model is able to over- 
come peak-specific noise also by averaging over multiple 
peaks. 

Benchmark data with known changes in concentrations 

In the first demonstration on real UPLC-MS data [22], we 
show that Model 1 can infer the artificial perturbations 
in a spike-in experiment more accurately than the single- 
peak approach. 

In the positive ion mode, the model inferred 794 clus- 
ters, among which 135 clusters included more than 
one peak. Seven clusters included annotated peaks from 
the spike-in compounds, four of which included more 
than one annotated peak (Additional file 1: Table S2). 
Peaks from two compounds were distributed to two and 
four clusters, respectively. In the negative ion mode, 
the model inferred 367 clusters, among which 113 clusters 
were non-singletons. Three clusters included annotated 
peaks from the spike-in compounds, all of these clusters 



included more than one annotated peak and all peaks 
from one compound were clustered together. In both 
the ion modes, all clusters with annotated peaks were 
specific to one compound. 

Model 1 had a lower error than Model 3 at all 
magnitudes of the true effect with the strongest rel- 
ative improvement occurring at the small magni- 
tudes (Figure 6). The difference was statistically significant 
for covariate effects from 0 to 40% (Additional file 1: 
Table S3). 

Lipidomic data from a gene silencing study 

In the second demonstration on real UPLC-MS data [24], 
we show that Model 1 can infer more consistent covari- 
ate effects in two ways even though the true effects are 
unknown. 

Consistency and robustness of effects 

When examining the consistency of effects within a lipid 
family. Model 1 was more consistent than Model 3 at all 
levels of noise (Figure 7). When no noise was added and 
also at moderate levels of noise, both approaches per- 
formed clearly better than expected by random chance. 
When noise was added. Model 3 suffered more and its 
performance reduced to the random level more rapidly. 
Given the assumption about the general similarity of lipids 
within a family is true. Model 1 inferred the covariate 
effects more consistently. 
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Error of the inferred effect 
as a function of the true effect 
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Figure 6 Model 1 had a more accurate quantification of 

the covariate effects for the spilce-in compounds as well as for 

the unchanged non-annotated compounds in the benchmark 

experiment. Root-mean-square error (RMSE;y-axis) between 

the inferred and true covariate effects is smaller for Model 1 (All 

peaks) than for the single-peak approach (Single peak) at all 

the magnitudes of the true effect (x-axis). Differences were statistically 

significant for changes of 0 to 40% (Additional file 1 : Table S3). 



Predictive performance 
as a function of added noise 
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Figure 7 Model 1 (All peaks) had a better accuracy at 
the prediction of signs of covariate effects for previously unseen 
lipids in the lipidomic gene-silencing data set compared to 
Model 3 (Single peak). The difference became pronounced when 
simulated noise was added to the data. The prediction was based on 
the inferred covariate effects of compounds from the same lipid family 
and was done in a cross-validation setting. In the task, the effects of 
the seven gene-silencing treatments were predicted on the three 
most abundant families of lipids in two time points. Points a = o 
and or > 0 on the x-axis show the prediction accuracy (y-axis) for 
the original data and the data with added noise, respectively. 



When examining the robustness of effect magnitudes, 
Model 1 was more consistent than Model 3 when noise 
was added to the data (Figure 8). The confidence intervals 
from the 100 randomizations did not overlap at all at 
moderate levels of noise. 

StabWWy 

Since the proposed approach is sensitive to the inferred 
clustering of the data, we analyzed the stability of 
the inferred clustering on biological data, using 
the lipidomic gene silencing data as a case study. We 
tested the influence of the concentration parameter q^dp 
in the Dirichlet process clustering model. The clustering 
result for the lipidomic gene silencing data was robust to 
changes in the magnitude of the concentration parame- 
ter (Additional file 1: Figure S2). As expected, the number 
of clusters increased, when the preset value of the con- 
centration parameter increased, but the relative change 
was small. 

Qualitative analysis 

Finally, we give concrete examples of potential findings 
that the approaches can uncover and demonstrate how 
analysis based on a single peak may lead to a different 
outcome depending on the choice of the peak. 

The intervention-driven changes of individual peaks 
from two lipids along with the covariate effects inferred 



Spearman correlation between 

covariate effects inferred 
from noisy and noise-free data 
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Figure 8 The covariate effects inferred by Model 1 (All peaks) 
were more robust to noise compared to Model 3 (Single peak). 

At moderate levels of noise, which is the regime of many biological 
experiments, the confidence intervals over 1 00 randomizations did 
not overlap at all. The robustness was quantified as the Spearman 
correlation (y-axis) between the effects inferred from the noisy and 
noise-free versions of the lipidomic gene silencing data set as 
a function of the level of noise (x-axis). 
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by Models 1 and 3 are shown in Figure 9. In the case 
of the sphingomyelin SM(dl8:l/22:0), there were strong 
covariate effects inferred by Model 3 but many of these 
effects became weaker when inferred based on multiple 
peaks by Model 1. On the contrary, Model 3 inferred weak 
covariate effects for the ceramide Cer(dl8:l/17:0) but 
based on multiple peaks and Model 1, one of the effects 
was actually among the top-5% strongest effects across 
the observed lipidome. 

Conclusions 

We have empirically demonstrated that a model-based 
integration of multiple peaks can lead to an improved 
accuracy in the inference of covariate effects, and 
we introduced a novel method for this task. While 
the sample-size is always restricted by external constraints 
such as the experiment budget or the availability of 



suitable patients, the inference based on multiple peaks 
gives a shortcut to extracting more information from 
the limited set of samples, thereby directly addressing 
the "small n, large p" problem. However, some types 
of systematic measurement error, such as some batch 
effects, escape this treatment and can only be reduced by 
introducing independent replicates. Based on the results 
presented in this work, we argue that additional peaks 
are especially useful when the signal-to-noise ratio is 
low and the differences between sample groups are 
small. 

We suggest that all the detected peaks that can be asso- 
ciated with a compound should be taken into account in 
the comparative analysis. This is possible through the two- 
step generative modeling approach presented in this work: 
(1) by identifying the peaks that can be associated with 
one compound through clustering the peaks based on 
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Figure 9 Example clusters of peaks from the lipidomic gene silencing data with differences in the covariate effects inferred based on 
a single peak and multiple peaks. The heat maps show changes in the lipid concentrations driven by the gene silencing interventions (columns). 
Covariate effects inferred by Models 3 and 1 using a single peak and all peaks, respectively, are shown on the two bottom rows of each heat map. 
The log2 fold changes of each peak associated with the compound are shown on the top rows. Changes that by the magnitude fall to the top-5% 
across the entire observed lipidome are highlighted by the symbol "T."Top (a): The sphingomyelin SM(dl 8:1 -22:0) with three peaks. Many strong 
changes for SM(dl 8:1 -22:0) became weaker when they were inferred based on all three peaks. Bottom (b):The ceramide Cer(d 18:1-1 7:0) with six 
peaks. The effect of the SCAP silencing for Cer(d 18:1-1 7:0) at 72 hours became strong when it was inferred based on all six peaks. 
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their shape similarity and (2) by the inference of covariate 
effects on the clusters, each representing one compound. 

Additional file 



Additional file 1 : Supplementary material. More details of 
the experiments. 
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